The IODS course offers training in data wrangling and analysis in an intensive but pleasant way. Datacamp offered a fun way to learn more about R. Working on Github showed the great potential of this system.
This week datawrangling, regression and model validation have been central in this course.
data_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt"
lrn14 <-read.table(data_url, sep=",", header = TRUE)
dim(lrn14)
## [1] 166 7
str(lrn14)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data consists of 166 students with the following 7 variables:
summary(lrn14)
## gender age attitude deep stra
## F:110 Min. :17.00 Min. :1.400 Min. :1.583 Min. :1.250
## M: 56 1st Qu.:21.00 1st Qu.:2.600 1st Qu.:3.333 1st Qu.:2.625
## Median :22.00 Median :3.200 Median :3.667 Median :3.188
## Mean :25.51 Mean :3.143 Mean :3.680 Mean :3.121
## 3rd Qu.:27.00 3rd Qu.:3.700 3rd Qu.:4.083 3rd Qu.:3.625
## Max. :55.00 Max. :5.000 Max. :4.917 Max. :5.000
## surf points
## Min. :1.583 Min. : 7.00
## 1st Qu.:2.417 1st Qu.:19.00
## Median :2.833 Median :23.00
## Mean :2.787 Mean :22.72
## 3rd Qu.:3.167 3rd Qu.:27.75
## Max. :4.333 Max. :33.00
The group consists of 110 females and 56 males. Their median age is 22 years with range [17-55]. The students got a median of 3.2 points with range [1.4-5] for attitude, a median of 3.7 with range [1.6-4.9] for deep learning approach (deep), a median of 3.2 with range [1.3-5] for strategic learning approach (stra), a median of 2.9 with range [1.6-4.3] for surface learning approach (surf) and a median of 23 points (points) for the exam with range [7-33].
pairs(lrn14[-1], col=lrn14$gender)
The grafical summary above shows the relationship between the values of the different measurements with bivariable scatter plots. The colour of the scatter points shows males as black dots and females as red dots.
library(GGally)
library(ggplot2)
p <- ggpairs(lrn14, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))
p
The grafic above displays in more different ways the data than the bivariable scatterplot. The distribution of the results is normal, except for the parameters age and deep which are respectively right-skewed and left-skewed. Within the material is the relationship between the attitude toward statistics and final exam points the strongest with a positive correlation of 0,4 for both sexes. The second most strong correlation is seen between deep learning approach and surface learning approach. This correlation of 0,6 is negative and specific for the male group, for both groups the correlation is -0,3.
model_1 <-lm(points ~ attitude + stra + deep, data = lrn14)
summary(model_1)
##
## Call:
## lm(formula = points ~ attitude + stra + deep, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.5239 -3.4276 0.5474 3.8220 11.5112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.3915 3.4077 3.343 0.00103 **
## attitude 3.5254 0.5683 6.203 4.44e-09 ***
## stra 0.9621 0.5367 1.793 0.07489 .
## deep -0.7492 0.7507 -0.998 0.31974
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 162 degrees of freedom
## Multiple R-squared: 0.2097, Adjusted R-squared: 0.195
## F-statistic: 14.33 on 3 and 162 DF, p-value: 2.521e-08
In the first model there is a clear significance with low p-value for attitude. The “deep”-variable will be removed and the model will be tested again.
model_2 <-lm(points ~ attitude + stra, data = lrn14)
summary(model_2)
##
## Call:
## lm(formula = points ~ attitude + stra, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.6436 -3.3113 0.5575 3.7928 10.9295
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.9729 2.3959 3.745 0.00025 ***
## attitude 3.4658 0.5652 6.132 6.31e-09 ***
## stra 0.9137 0.5345 1.709 0.08927 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.289 on 163 degrees of freedom
## Multiple R-squared: 0.2048, Adjusted R-squared: 0.1951
## F-statistic: 20.99 on 2 and 163 DF, p-value: 7.734e-09
In this model is attitude again the only significant variable, that explains changes in the exam points. The model will be runned again without the “stra”-variable.
model_3 <-lm(points ~ attitude, data = lrn14)
summary(model_3)
##
## Call:
## lm(formula = points ~ attitude, data = lrn14)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
Residuals: The residuals show the difference between the values predicted by the model and the actual values. A good predictive value should have a symmetric distribution of the residuals.
Coefficients: The intercept is the expected mean value of Y when all X=0. The attitude’s slope is estimated at 3.53, meaning that the maximum test points change with 3,5 points for every point of attitude change. Standard Error should fall within plus/minus 2 times the standard error of the regression from the regression line within a 95% prediction interval.The Pr(>t) acronym found in the model output relates to the probability of observing any value equal or larger than t. A small p-value indicates that it is unlikely we will observe a relationship between the predictor (attitude) and response (points) variables due to chance. Three asterisks represent a highly significant p-value.
Residual standard error: Residual Standard Error is measure of the quality of a linear regression fit. The Residual Standard Error is the average amount that the response (points) will deviate from the true regression line.
R-squared statistic provides a measure of how well the model is fitting the actual data. A value of null means that the variable doesn’t explain the variance in the dependent; 1 means the total opposite. In this case roughly 19 % of the variance found in the response variable (points) can be explained by the predictor variable (attitude).
F-statistic is a good indicator whether there is a relationship between the variable and the dependent value. The further away from 0 the better the reliability of the test and the possibilty to reject the H0 (Noll hypothesis: " There is no relationship between attitude and points.“).
plot(model_3, which = c(1:2, 5))
Residuals vs Fitted values: In this plot there is a linear relationship between the predictor variables and the outcome variables, confirming the assumption that there are no non-linear relationships between the parameters in the test and the linear model fits well for this data. Some outliers can be observed in the model.
Normal Q-Q: The relationship between the theoretical and standarized quantiles is mainly following a straight line, confirming that the choice for a linear model was the correct one. In the lower left corner there are some outliers observed.
Residuals vs Leverage: Some outliers could be identified in the residuals vs fitted values, but they do not show high leverage on the observations and the model can be used to discribe the relationship between ‘attitude’ and ‘points’. The outliers in this case can be kept in the model. If they were due to possible registration errors in the data and had a high leverage on the model, they could be removed.
This week datawrangling, logistic regression, odds ratio and prediction have been central in this course.
data_url <- "http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/alc.txt"
alc <-read.table(data_url, sep=",", header = TRUE)
dim(alc)
## [1] 382 35
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data consists of 382 observations with the following 35 variables:
I would chose sex, quality of family relationship, going out with friends and the willingness to take higher education as variables for further investigastion. I would believe that male students, that go out a lot would use more alcohol. Willingness to take higher education and a good family relationship could be intuitively associated with low alcohol consumption.
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
ggplot(alc, aes(x = high_use, y = age, col = sex)) + geom_boxplot() + ylab("age") + ggtitle("Student age by high alcohol use and sex")
ggplot(alc, aes(x = high_use, y = goout, col = sex)) + geom_boxplot() + ylab("Going out") + ggtitle("Going out with friends by high alcohol use and sex")
ggplot(data = alc, aes(x =goout, fill = high_use)) + geom_bar() + xlab("Going out") + ggtitle("Going out with friends and high alcohol use distribution")
ggplot(alc, aes(x = high_use, y = famrel, col = sex)) + geom_boxplot() + ylab("Family relationship") + ggtitle("Family relationship by high alcohol use and sex")
ggplot(data = alc, aes(x = famrel, fill = high_use)) + geom_bar() + xlab("Family relationship") + ggtitle("Family relationship and high alcohol use distribution")
table(alc$high_use, alc$sex)
##
## F M
## FALSE 157 113
## TRUE 41 71
alc %>% group_by(famsup) %>% summarise(count = n(), mean_alc_use = mean(alc_use))
## # A tibble: 2 x 3
## famsup count mean_alc_use
## <fctr> <int> <dbl>
## 1 no 144 1.954861
## 2 yes 238 1.829832
alc %>% group_by(alc_use) %>% summarise(count = n(), mean_goout = mean(goout))
## # A tibble: 9 x 3
## alc_use count mean_goout
## <dbl> <int> <dbl>
## 1 1.0 144 2.736111
## 2 1.5 68 2.882353
## 3 2.0 58 3.120690
## 4 2.5 42 3.309524
## 5 3.0 32 4.000000
## 6 3.5 17 3.764706
## 7 4.0 9 4.222222
## 8 4.5 3 3.666667
## 9 5.0 9 4.222222
Student age by high alcohol use and sex: In the student age by high alcohol use and sex boxplot, men seem to use at a higher age more alcohol. Numerically there doesn’t seem to be a relationship based on the mean ages for the whole group.
Going out with friends by high alcohol use and sex: There seems to be a relationship between going out frequently with friends and high alcohol use.
Family relationship by high alcohol use and sex: A good family relationship seems to protect against high use of alcohol.
library(tidyr); library(dplyr)
table(alc$higher, alc$high_use) %>% prop.table %>% addmargins()*100
##
## FALSE TRUE Sum
## no 2.356021 2.356021 4.712042
## yes 68.324607 26.963351 95.287958
## Sum 70.680628 29.319372 100.000000
In the crosstabulation above between the willingness to pursue higher education (“no”, “yes”) and high use of alcohol (“FALSE”, “TRUE”) the numbers are expressed as percentage of the whole studymaterial. Students having a high alcohol use seem not to pursue higher education in the same proportion as the students that drink less.
library(tidyr)
alc$famrel <- factor(alc$famrel)
alc$goout <- factor(alc$goout)
m <- glm(high_use ~ alc$famrel + alc$goout + higher + sex, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ alc$famrel + alc$goout + higher + sex,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5569 -0.7303 -0.4874 0.7327 2.4733
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.9444 1.2071 -1.611 0.107226
## alc$famrel2 0.3801 1.0768 0.353 0.724127
## alc$famrel3 0.4501 0.9709 0.464 0.642952
## alc$famrel4 -0.3165 0.9489 -0.334 0.738730
## alc$famrel5 -0.9362 0.9710 -0.964 0.334934
## alc$goout2 0.2091 0.6940 0.301 0.763227
## alc$goout3 0.5296 0.6768 0.782 0.433976
## alc$goout4 2.0342 0.6753 3.012 0.002594 **
## alc$goout5 2.4735 0.6968 3.550 0.000385 ***
## higheryes -0.3391 0.5666 -0.598 0.549523
## sexM 0.9851 0.2635 3.739 0.000185 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 462.21 on 381 degrees of freedom
## Residual deviance: 378.16 on 371 degrees of freedom
## AIC: 400.16
##
## Number of Fisher Scoring iterations: 4
According to the logistic regression model, males are significantly more prone to high alcohol use. Also frequently going out correlates significantly with high alcohol use. Lack of family support or pursuing higher education do not correlate significantly with alcohol consumption in this model.
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.1430801 0.01029021 1.270211
## alc$famrel2 1.4623806 0.20016974 15.231512
## alc$famrel3 1.5684279 0.27441876 13.984373
## alc$famrel4 0.7286997 0.13357043 6.261852
## alc$famrel5 0.3921018 0.06771719 3.457162
## alc$goout2 1.2325383 0.35085792 5.807814
## alc$goout3 1.6982026 0.50605996 7.820783
## alc$goout4 7.6459609 2.29792372 35.198633
## alc$goout5 11.8638973 3.39834985 56.396606
## higheryes 0.7124103 0.23173686 2.193897
## sexM 2.6782042 1.60825469 4.529032
The odds ratios show 2.7 times higher risk for males to have a high alcohol consumption. Also going out frequently (4) to very frequently (5) adds respectively 7.6 and 11.9 times the risk of high alcohol consumption.
Conclusion:
In the comparison between the logistic regression model and the expected relationships based upon the box and bar plots; lack of family support and pursuing higher education do not correlate significantly with alcohol consumption in this model. Male sex and going out are significant risk factors for high consumption of alcohol.
probabilities <- predict(m, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65445026 0.05235602 0.70680628
## TRUE 0.14921466 0.14397906 0.29319372
## Sum 0.80366492 0.19633508 1.00000000
The prediction missed 14.9 % true cases of high alcohol use and predicted 5.2 % of high_use, that weren’t high_use cases.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2015707
The prediction of the logistic regression model deviates 20 % of the actual values.
library(boot)
m <- glm(high_use ~ famrel + goout + higher + sex, data = alc, family = "binomial")
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)
cv$delta
## [1] 0.2251309 0.2193745
The predictive model with a loss function of 20.2 % was better than prediction error of 22.5 % of the cross validation.
This week clustering and classification are central in this course.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
In the “Boston”-data set there are 506 observations with the following 14 variables:
pairs(Boston)
This plot isn’t very helpful for interpretation of the data, but shows some correlative relationships. A clearer statement about correlations will be made based upon the correlationsplot and -table below.
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
In the summary there are som non-parametrically distributed parameters as crim, zn, rad, age, tax, indus and black. Other parameters can be distributed parameterically. In case of doubt a test for normality could be done.
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v tibble 1.3.4 v purrr 0.2.4
## v readr 1.1.1 v stringr 1.2.0
## v tibble 1.3.4 v forcats 0.2.0
## -- Conflicts ----------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x MASS::select() masks dplyr::select()
library(corrplot)
## corrplot 0.84 loaded
cor_matrix<-cor(Boston) %>% round(digits = 2)
cor_matrix
## crim zn indus chas nox rm age dis rad tax
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47
## ptratio black lstat medv
## crim 0.29 -0.39 0.46 -0.39
## zn -0.39 0.18 -0.41 0.36
## indus 0.38 -0.36 0.60 -0.48
## chas -0.12 0.05 -0.05 0.18
## nox 0.19 -0.38 0.59 -0.43
## rm -0.36 0.13 -0.61 0.70
## age 0.26 -0.27 0.60 -0.38
## dis -0.23 0.29 -0.50 0.25
## rad 0.46 -0.44 0.49 -0.38
## tax 0.46 -0.44 0.54 -0.47
## ptratio 1.00 -0.18 0.37 -0.51
## black -0.18 1.00 -0.37 0.33
## lstat 0.37 -0.37 1.00 -0.74
## medv -0.51 0.33 -0.74 1.00
corrplot(cor_matrix, method="circle", type = "upper", cl.pos= "b", tl.pos = "d", tl.cex = 0.6)
There seem to be negative correlations between dis versus age, nox, indus and tax, suggesting that the further from the employment centers the buildings are younger, there is less nitrous oxide pollution ,the proportion of non-retail businesses is lower and the full-value property-tax rate is higher. The correlation between medv and lstat is outspoken negative, meaning that areas with lower social status of the population have cheaper owner-occupied houses and vice versa.
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
The data is scaled to the values above to eliminate the effect of different values on the following analyses. The data is divided into a training and test set.
# create a quantile vector of crim
bins <- quantile(boston_scaled$crim)
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, labels = c("low", "med_low", "med_high", "high"), include.lowest = TRUE)
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2549505 0.2500000 0.2524752 0.2425743
##
## Group means:
## zn indus chas nox rm
## low 1.0457215 -0.9235255 -0.08120770 -0.9004131 0.43925451
## med_low -0.1058035 -0.3271988 -0.07742312 -0.5479747 -0.18237434
## med_high -0.3678568 0.1269422 0.11366115 0.3828500 0.06706428
## high -0.4872402 1.0171960 -0.03128211 1.0587241 -0.41217864
## age dis rad tax ptratio
## low -0.9574668 0.9195948 -0.7008870 -0.7197531 -0.47948452
## med_low -0.3068077 0.3539912 -0.5463634 -0.5065695 -0.02519394
## med_high 0.4075183 -0.3555256 -0.4132674 -0.3176230 -0.27924638
## high 0.8327664 -0.8605003 1.6373367 1.5134896 0.77985517
## black lstat medv
## low 0.37660800 -0.78465329 0.53944898
## med_low 0.32406720 -0.10623180 -0.01433236
## med_high 0.08484668 0.06192789 0.13618284
## high -0.92585232 0.83742452 -0.62521346
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08342340 0.688240051 -0.844109247
## indus 0.01822019 -0.246199139 0.292168589
## chas -0.09601362 0.009676212 0.005703292
## nox 0.38425826 -0.659012846 -1.361175543
## rm -0.11258068 -0.061224023 -0.208774726
## age 0.24475360 -0.367481436 -0.047998587
## dis -0.09416804 -0.241182993 0.151987479
## rad 3.15701200 0.911645049 0.229121603
## tax -0.00246475 0.092294848 0.246624091
## ptratio 0.11789816 -0.012784553 -0.209833227
## black -0.15117044 0.013600403 0.173207965
## lstat 0.23222072 -0.298119900 0.300985085
## medv 0.21578267 -0.417242271 -0.190862907
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9482 0.0400 0.0118
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit , dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
In this example the LD1-model explains 95 % of the clustering with the index of accessibility to radial highways (rad) as the most influential linear separator.
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 12 9 3 0
## med_low 7 13 5 0
## med_high 0 7 16 1
## high 0 0 0 29
The model predicts well the true values for the categorical crime rate. Most of the wrong predictions were adressed to the med_low crime group.
#reload Boston and scale
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
# scaled distance matrix
dist_scaled <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_scaled)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_scaled, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <-kmeans(dist_scaled, centers = 2)
# plot the Boston dataset with clusters
km <-kmeans(dist_scaled, centers = 2)
pairs(boston_scaled, col = km$cluster)
The optimal number for clusters, according to k-means algorithm is 2. In the correlation plot above, we can see that the plots with strong correlations between the parameters, show for most of the variables a division into two clusters, according to k-means method.
km <-kmeans(dist_scaled, centers = 4)
lda.fit <- lda(km$cluster~., data = boston_scaled)
lda.fit
## Call:
## lda(km$cluster ~ ., data = boston_scaled)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.2272727 0.2114625 0.1304348 0.4308300
##
## Group means:
## crim zn indus chas nox rm
## 1 0.2797949 -0.4872402 1.1892663 -0.2723291 0.8998296 -0.2770011
## 2 -0.3912182 1.2671159 -0.8754697 0.5739635 -0.7359091 0.9938426
## 3 1.4330759 -0.4872402 1.0689719 0.4435073 1.3439101 -0.7461469
## 4 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814
## age dis rad tax ptratio black
## 1 0.7716696 -0.7723199 0.9006160 1.0311612 0.60093343 -0.01717546
## 2 -0.6949417 0.7751031 -0.5965444 -0.6369476 -0.96586616 0.34190729
## 3 0.8575386 -0.9620552 1.2941816 1.2970210 0.42015742 -1.65562038
## 4 -0.3256000 0.3182404 -0.5741127 -0.6240070 0.02986213 0.34248644
## lstat medv
## 1 0.6116223 -0.54636549
## 2 -0.8200275 1.11919598
## 3 1.1930953 -0.81904111
## 4 -0.2813666 -0.01314324
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.18113078 -0.5012256 -0.60535205
## zn -0.43297497 -1.0486194 0.67406151
## indus -1.37753200 0.3016928 1.07034034
## chas 0.04307937 -0.7598229 -0.22448239
## nox -1.04674638 -0.3861005 -0.33268952
## rm 0.14912869 -0.1510367 0.67942589
## age 0.09897424 0.0523110 0.26285587
## dis -0.13139210 -0.1593367 -0.03487882
## rad -0.65824136 0.5189795 0.48145070
## tax -0.28903561 -0.5773959 0.10350513
## ptratio -0.22236843 0.1668597 -0.09181715
## black 0.42730704 0.5843973 0.89869354
## lstat -0.24320629 -0.6197780 -0.01119242
## medv -0.21961575 -0.9485829 -0.17065360
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7596 0.1768 0.0636
plot(lda.fit, dimen = 2, col=classes, pch=classes)
lda.arrows(lda.fit, myscale = 3)
In this example the LD1-model explains only 76 % of the clustering with rad, nox, indus and zn as the most influential linear separators. K-mean algorithm with 4 centers isn’t a better model to cluster this data.
model_predictors <- dplyr::select(train, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
#k-means
dist_train <- dist(train)
## Warning in dist(train): NAs introduced by coercion
km <-kmeans(dist_train, centers = 2)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
#Matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
library("plotly")
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Create a 3D plot of the columns of the matrix product
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color=km$cluster)
We can see that the high risk and med_high risk crime groups from the LDA-trained set overlap with one cluster according to the k-means algorithm for two cluster. The other cluster, according to k-means algoritm overlaps with low and med_low crime groups from the LDA-trained set.
This week dimensionality reduction techniques are central in this course.
# To be sure that I work with the correct data I started from the new link.
human <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", sep =",", header = TRUE)
dim(human)
## [1] 155 8
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Edu2.FM : num 1.007 0.997 0.983 0.989 0.969 ...
## $ Labo.FM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ Edu.Exp : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ Life.Exp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ Mat.Mor : int 4 6 6 5 6 7 9 28 11 8 ...
## $ Ado.Birth: num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ Parli.F : num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
In the “human”-data file there are 155 countries with the following 8 variables:
GNI : Gross National Income per capita
Life.Exp : Life expectancy at birth
Edu.Exp : Expected years of schooling
Mat.Mor : Maternal mortality ratio
Ado.Birth : Adolescent birth rate
Parli.F : Percetange of female representatives in parliament
Edu2.FM : Edu2.F / Edu2.M
Labo.FM : Labo2.F / Labo2.M
library(GGally)
library(corrplot)
library(dplyr)
# visualize the 'human_' variables
ggpairs(human)
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot
In the plots above we can see the highest positive correlations between life expectancy, expected years of schooling and GNI, suggesting that better education induces longer lives and better economy in countries. On the other hand are maternal mortality and adolescent birth rates correlating negatively with life expectancy and expected years of schooling, suggesting that low levels of national education could induce a high amount of adolescent mothers with a high perinatal mortality, inducing a lower life expectancy.
Variables “Edu2.FM”, “Labo.FM”, “Life.Exp”, “Edu.Exp” and “Parli.F” are parametrical as their mean and median are close to each other. On the other hand, “GNI”, “Mat.Mor” and “Ado.Birth” are non-parametric.
# print out summaries of the standardized variables
summary(human)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :0.1717 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.:0.7264 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median :0.9375 Median :0.7535 Median :13.50 Median :74.20
## Mean :0.8529 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.:0.9968 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :1.4967 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI Mat.Mor Ado.Birth Parli.F
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)
library(dplyr)
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 100 0 0 0 0 0 0 0
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1] , ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## Edu2.FM Labo.FM Edu.Exp Life.Exp
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7378 Min. :-2.7188
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6782 1st Qu.:-0.6425
## Median : 0.3503 Median : 0.2316 Median : 0.1140 Median : 0.3056
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.7126 3rd Qu.: 0.6717
## Max. : 2.6646 Max. : 1.6632 Max. : 2.4730 Max. : 1.4218
## GNI Mat.Mor Ado.Birth Parli.F
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
library(dplyr)
# create and print out a summary of pca_human
s <- summary(pca_human)
# rounded percetanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1] , ylab = pc_lab[2])
The PCA plot of the unscaled data indicates that their is only one variable that explains the variance completely. After scaling the variables are better plotted and comparable.
Adolescent birth and maternal mortality are directly opposite to gross national income per capita, educational level and life expectancy between countries and the variables in this plain and they explain 53.16 % of the variation between the countries (PC1), suggesting that there is almost a clear division between these opposite variabeles in different countries. Or clearly stated: good economy, education and life expectancy in a country “keep”" adolescent pregnancy and maternal death rate low.
In the PC2-plain a high percentage of female representation in parlaiment and labour market explains another 16.2 % of the differences between the countries.
library(ggplot2)
library(dplyr)
library(tidyr)
library(FactoMineR)
# column names to keep in the dataset
data(tea)
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")
# select the 'keep_columns' to create a new dataset
tea_time <- select(tea, one_of(keep_columns))
# look at the summaries and structure of the data
summary(tea_time)
## Tea How how sugar
## black : 74 alone:195 tea bag :170 No.sugar:155
## Earl Grey:193 lemon: 33 tea bag+unpackaged: 94 sugar :145
## green : 33 milk : 63 unpackaged : 36
## other: 9
## where lunch
## chain store :192 lunch : 44
## chain store+tea shop: 78 Not.lunch:256
## tea shop : 30
##
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)
# summary of the model
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.279 0.261 0.219 0.189 0.177 0.156
## % of var. 15.238 14.232 11.964 10.333 9.667 8.519
## Cumulative % of var. 15.238 29.471 41.435 51.768 61.434 69.953
## Dim.7 Dim.8 Dim.9 Dim.10 Dim.11
## Variance 0.144 0.141 0.117 0.087 0.062
## % of var. 7.841 7.705 6.392 4.724 3.385
## Cumulative % of var. 77.794 85.500 91.891 96.615 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.298 0.106 0.086 | -0.328 0.137 0.105 | -0.327
## 2 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 3 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 4 | -0.530 0.335 0.460 | -0.318 0.129 0.166 | 0.211
## 5 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 6 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 7 | -0.369 0.162 0.231 | -0.300 0.115 0.153 | -0.202
## 8 | -0.237 0.067 0.036 | -0.136 0.024 0.012 | -0.695
## 9 | 0.143 0.024 0.012 | 0.871 0.969 0.435 | -0.067
## 10 | 0.476 0.271 0.140 | 0.687 0.604 0.291 | -0.650
## ctr cos2
## 1 0.163 0.104 |
## 2 0.735 0.314 |
## 3 0.062 0.069 |
## 4 0.068 0.073 |
## 5 0.062 0.069 |
## 6 0.062 0.069 |
## 7 0.062 0.069 |
## 8 0.735 0.314 |
## 9 0.007 0.003 |
## 10 0.643 0.261 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## black | 0.473 3.288 0.073 4.677 | 0.094 0.139
## Earl Grey | -0.264 2.680 0.126 -6.137 | 0.123 0.626
## green | 0.486 1.547 0.029 2.952 | -0.933 6.111
## alone | -0.018 0.012 0.001 -0.418 | -0.262 2.841
## lemon | 0.669 2.938 0.055 4.068 | 0.531 1.979
## milk | -0.337 1.420 0.030 -3.002 | 0.272 0.990
## other | 0.288 0.148 0.003 0.876 | 1.820 6.347
## tea bag | -0.608 12.499 0.483 -12.023 | -0.351 4.459
## tea bag+unpackaged | 0.350 2.289 0.056 4.088 | 1.024 20.968
## unpackaged | 1.958 27.432 0.523 12.499 | -1.015 7.898
## cos2 v.test Dim.3 ctr cos2 v.test
## black 0.003 0.929 | -1.081 21.888 0.382 -10.692 |
## Earl Grey 0.027 2.867 | 0.433 9.160 0.338 10.053 |
## green 0.107 -5.669 | -0.108 0.098 0.001 -0.659 |
## alone 0.127 -6.164 | -0.113 0.627 0.024 -2.655 |
## lemon 0.035 3.226 | 1.329 14.771 0.218 8.081 |
## milk 0.020 2.422 | 0.013 0.003 0.000 0.116 |
## other 0.102 5.534 | -2.524 14.526 0.197 -7.676 |
## tea bag 0.161 -6.941 | -0.065 0.183 0.006 -1.287 |
## tea bag+unpackaged 0.478 11.956 | 0.019 0.009 0.000 0.226 |
## unpackaged 0.141 -6.482 | 0.257 0.602 0.009 1.640 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.108 0.410 |
## How | 0.076 0.190 0.394 |
## how | 0.708 0.522 0.010 |
## sugar | 0.065 0.001 0.336 |
## where | 0.702 0.681 0.055 |
## lunch | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")
In the plot above we can observe both dimensions (Dim1 and Dim2), that explain respectively 15.2 and 14,2 % of the variance between the variables and cumulatively 29.5 %. On top of these there are 5 more dimensions, that explain the variance between the variables.